Model Selection

Multimodal Vision-Language

# Multimodal Vision-Language

Qwen2.5 VL 7B Instruct Gemlite Ao A8w8

This is a multimodal large language model quantized with A8W8, based on Qwen2.5-VL-7B-Instruct, supporting vision and language tasks.

Llava 1.5 13b Hf I1 GGUF

This project provides weighted/matrix quantized versions of the llava-1.5-13b-hf model, including various quantization types to meet the usage requirements in different scenarios.

Transformers English

Spaceqwen2.5 VL 3B Instruct I1 GGUF

SpaceQwen2.5-VL-3B-Instruct is a 3B-parameter vision-language model focused on spatial reasoning and multimodal tasks.

Text-to-Image English

VLM R1 Qwen2.5VL 3B OVD 0321

A zero-shot object detection model based on Qwen2.5-VL-3B-Instruct, enhanced with VLM-R1 reinforcement learning, supporting open vocabulary detection tasks.

Safetensors English

Eagle 2 is a high-performance vision-language model family that focuses on transparency in data strategies and training schemes, aiming to drive the open-source community in developing competitive vision-language models.

Transformers Other

Eagle2 is a high-performance vision-language model family introduced by NVIDIA, focusing on enhancing the performance of open-source vision-language models through data strategies and training approaches. Eagle2-2B is the lightweight model in this series, achieving outstanding efficiency and speed while maintaining robust performance.

Transformers Other

Minivla Libero90 Prismatic

MiniVLA is a 1-billion-parameter vision-language model compatible with the Prismatic Vision-Language Model codebase, suitable for robotics and multimodal tasks.

Transformers English

Paligemma2 28b Mix 224

PaliGemma 2 is an upgraded vision-language model launched by Google, combining the capabilities of Gemma 2 and SigLIP vision models, supporting multilingual image-text interaction tasks.

Paligemma2 28b Mix 448

PaliGemma 2 is a vision-language model based on Gemma 2, supporting image+text input and text output, suitable for various vision-language tasks.

Paligemma2 10b Pt 896

PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, integrating Gemma 2 capabilities, supporting image and text input to generate text output

Paligemma2 10b Pt 448

PaliGemma 2 is Google's upgraded vision-language model (VLM) that combines Gemma 2 capabilities, supporting image and text input to generate text output.

Paligemma2 3b Pt 448

PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text input to generate text output, suitable for various vision-language tasks.

Paligemma2 3b Pt 224

PaliGemma 2 is a vision-language model (VLM) developed by Google, combining the capabilities of the Gemma 2 language model and SigLIP vision model, supporting multilingual vision-language tasks.

Paligemma2 10b Mix 224

PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text input to generate text output, suitable for various vision-language tasks.

Paligemma2 3b Mix 448

PaliGemma 2 is a vision-language model based on Gemma 2, supporting image and text inputs with text generation output, suitable for various vision-language tasks.

Paligemma2 3b Ft Docci 448

PaliGemma 2 is an upgraded vision-language model released by Google, combining the capabilities of Gemma 2 and SigLIP vision models, supporting multilingual vision-language tasks.

Llama 3.1 8B Dragonfly V2

Dragonfly is a multimodal vision-language model fine-tuned with instructions based on Llama 3.1, supporting joint understanding and generation of images and text

Image-to-Text English

togethercomputer

OpenVLA v0.1 7B is an open-source vision-language-action model trained on the Open X-Embodiment dataset, supporting various robot controls.

Transformers English

Paligemma 3b Pt 448

PaliGemma is a lightweight and versatile vision-language model built on the SigLIP vision model and Gemma language model, supporting multilingual image-text interaction tasks.

Paligemma 3b Pt 896

PaliGemma is a versatile lightweight vision-language model (VLM) that supports image and text inputs and generates text outputs. It has multilingual capabilities.

Paligemma 3b Ft Refcoco Seg 896

PaliGemma is a lightweight vision-language model developed by Google, built upon the SigLIP vision model and Gemma language model, supporting multilingual text generation and visual understanding tasks.

Paligemma 3b Mix 224

PaliGemma is a versatile, lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs with text outputs.

Paligemma 3b Pt 224

PaliGemma is a versatile lightweight vision-language model (VLM) built upon SigLIP vision model and Gemma language model, capable of processing both image and text inputs to generate text outputs.

Vitamin XL 384px

ViTamin-XL-384px is a large-scale vision-language model based on the ViTamin architecture, specifically designed for vision-language tasks, supporting high-resolution image processing and multimodal feature extraction.

Internvl 14B 224px

InternVL-14B-224px is a 14B-parameter vision-language foundation model supporting various vision-language tasks.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase